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Abstract 

Estimates for the variances of empirically determined scoring 
weights are given. It is also shown that test item writers should 
write distractors that discriminate on the criterion variable when 
this type of scoring is used. 



A NOTE ON THE VARIANCES OF EMPIRICALLY 
DERIVED OPTION SCORING WEIGHTS*^ 
Gary Echternacht 
Educational Testing Service 

In recent years, the developers of large-scale testing operations 
have shown an increasing interest in reducing the length of time 
examinees are required to spend on a given test. Reducing the test 
administration time would both reduce the cost of developing the test 
forms, as fewer items would be required, and allow time for additional 
tests to be administered. This thinking has characterized many of the 
test programs administered at Educational Testing Service, and, most 
likely, at other testing establishments. Researchers have thus sought 
new scoring methods that would result in increases in reliability due 
solely to the scoring system used. Thus, test length could be reduced, 
and a previous standard of reliability maintained. 

One such scoring method that has proven successful in reliability 
studies is that of empirically deriving scoring weights (Davis & Fifer, 
1959; Echternacht, 1973; Hendrickson, 1971; Reilly & Jackson, 1972; Strong, 
1943). If empirically derived scoring weights were to be adopted by 
such large-scale testing programs as the College Entrance Examination 
Board, the Graduate Record Examinationv<^ , the Law School Admission Test, 
and other programs, one problem that would have to be faced is that of 
determining the variances of thr^ derived weights and the implications 
these variances have for developing test items. This is necessary 
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on repeated occasions, and the scoring weights would only be developed 



on the initial administration. Since some examinees would not be 
included in the initial scoring run, the problem of scoring weight 
variance exists. Also, by knowing this variance, the minimum number 
of examinees needed to develop the weights, subject to a specif ii?.d 
level of precision, can be determined. 

There are a number of methods that can be used for deriving the 
weights. The method that will be discussed here is that used by 
Echternacht (1973), which is actually the method used by Re'illy and 
Jackson (1972) with no iterations. Briefly, the method consists of 
assigning the average criterion score of those selecting a given 
option. The criterion variable is standardized, so that its mean is 
zero and variance is one. The criterion that is usually used is the 
score on the remaining items that make up the test although this is 
certainly not a necessaiy criterion. 

Consider a population of N people who will take a given test 
at one point' in time. Assume further that a simple random saniple of 
n people from the population take the test for the purpose of 
determining scoring weights. Although this is not exactly true in an 
operational setting, it does provide a useful approximation to reality. 
Consider one item for that test. The scoring weight assigned to the 
1th option of this item is 
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where represents the number of people responding with the ith 

option and y_^^ represents the criterion score for the j th person 
choosing the ith option. In weighting options, the omit category 
is considered another option and a weight is also derived. Since 
the criterion variable is assumed to be standardized. 



2 ^.y. /n = y =0 
i=l ^ ^• 



where n - Z n. , the number of people responding to the item 
i=l ^ 

with one of the c possible options. Using the standard result 

for the variance of a mean obtained by simple random sampling from 

a finite population, the variance of the i_th option weight thus becomes 



(1/n, - 1/N.) S: 
1 11 



where 



1 j=i 



•N^ indicates the number of examinees in the population responding 

2 

with option i. The problem becomes one of estimating . This 

is done by using the unbiased estimate 



"i 
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'± - n— .^ -yi.> . 
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such esti-ate. o£ would pres^ably ba obtained through pretesting 

of the item. 

suppose the »hole population of N examinees is used for the 
purpose of determining scoring weights, and the method previously 
described is used. 
Now, 

N 

^1 \ ^i, "13 -^.^'-^ 

Where c Indicates the number of response options. From the standard 
algebraic Identify for the analysis of variance, with 



c 

N = EN, 
i=l 



N. 

c 1 _ 2 



(N-l)S^ = (N-1) = (^ij -^"^ 



i=l i i=l3=l ' 



= E N,Y, 2 + I (N,-l) Sl . (1) 
i=l ^ i=l 

If the 1/N is negligible (1) may be written as 

c . 2 . 5 „c2 (2) 



1 = I W.Y + E W S 
i=l ^ i=l 
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where 



so that 



W. = N,/N 
1 1 



E W. (1-Y? ) 
i=l ^ 
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and 



-2^2 
I w. (1-y: ) = Z W.S, 

i^l ^ i=l ^ i 
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which indicates that the S_j^ are not independent for all c categories, 

In obtaining empirically derived' scoring weights, it is, of course, 

desirable to have the variance of the resulting weights be of a 

minimum* If a large enough pool of examinees are tested in the initial 

test administration so that the n^ are all large for each item, 

the variances will likely be small* This is not always the case, 

though, and it does not tell the item writer anything about how he 

should write the items to help insure that a small variance results. 

The item writer can have some influence over both the n. and the 

1 

2 2 
S, . By increasing the n. and decreasing the S. the ith 
1 11 — 

option weight's variance will decrease. But, the n. and S. are 

1 1 

not independent for a given item. Therefore, it seems reasonable 

-2 

to consider minimizing S and the implications this minimization 

-2 

has for item writers. One can see that S can be minimized by 

c ^2 

making the between options sum of squares, E N.Y , a maximum. 

i-1 ^ 
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Although it is recognized that the following discussion is 

somewhat esoteric for the item writer and the condi::ions presented 

very unrealistic, the discussion following is an attempt to demonstrate 

-2 

some of the basic principles that should be used in minimizing S 



<^ -2 

In maximizing 0 = Z N Y. , a few things need to be noted • 

1=1 i ^• 

In the case where c=2 , it can be easily shown that Q attains a 



minimum when Z Y = 0 , or when each category mean equals the 
.1=1 



overall mean. Also, if Z Y can be considered given and Q 

a function of only the" N^'s , Q is minimized when = N/2 . 

Since we are considering a finite population, a maximum value of Q 
is obtained when all positive Y^_j are found in one category and all 

negative Y, . in the other • The zero values of Y. . are placed in 

the category with the largest N^ 

In cases where c>2 ^ it cen be shown that Q is minimized when 



^i 

Z Y = 0 for each i , or if the sums, Z Y , are considered 



^i 

fixed, when the N. are proportional to | Z Y | • Maximum 

j=i 
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values can be obtained only when the criterion values can be partitioned 
into nonoverlapping regions, with each region corresponding to a group 
of people responding with a particular dis tractor. In topological 
terms these regions are termed "connected" regions, and their union 
consists of the entire criterion variable space. This is also the 
case V7here each distractor can be used to place the individual 
responding with that distractor in a categorization of the criterion. 

In practice though, it is impossible for an item writer to 
write items with the property previously noted. The item v/riter can 
structure distractors in such a way that examinees of differing ability 
levels respond to different distractors. Such a practice would tend to 
approximate the condition mentioned previously, acsuming that ability 
and the criterion are related, and allow Q to be maximized as much 
as is practical. The procedure of "facet design" as set forth by 
Guttman (see Elizur, 1970) is one method that might be used to so 
structure the distractors. In examining the results of item pretesting, 
the quantity Q should also be taken into consideration in making 
the decision of whether or not to include a given item as part of a 
test that will be scored using empirically derived option weights. 
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